========================================================

R Markdown

Read the csv file

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

The csv file has been read and saved to a dataframe. The first 5 rows have been printed above.

Looking for a summary of each column, its structure and names of all columns

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The above output for summary gives us statistics like mean, mediam and various quartiles of all columns in the red wine quality data. The quartile values gives us interesting information on which columns could contain outliers. An outlier is considered to be present if a point is 1.5 times of the interquartile range more than the 3rd quartile or less than the 1st quartile. The columns containing outliers based on this calculation are:

residual.sugar free.sulfur.dioxide total.sulfur.dioxide fixed.acidity chlorides sulphates

The str command showed me the kind of data(datatype) stored in each column and also sample values in each column. The names command gives me names of all the columns. I can always use this command to refer to column names if I am not able to recollect them during my project.

Now, I would like to check if we have any factored variables in our data.

Check for factor variables

##                    X        fixed.acidity     volatile.acidity 
##                FALSE                FALSE                FALSE 
##          citric.acid       residual.sugar            chlorides 
##                FALSE                FALSE                FALSE 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                FALSE                FALSE                FALSE 
##                   pH            sulphates              alcohol 
##                FALSE                FALSE                FALSE 
##              quality 
##                FALSE

Turns out, none of the columns are factored variables.

Install ggplot2

Study the nature of each column

Quality

Quality is not a continuous variable. I first tried to plot it using the qplot and I figured that using the ggplot function makes it look much better and does not mislead us into thinking that it is a continuous variable like the other columns.

Residual Sugar(with outliers)

Residual Sugar(without outliers)

Adding a scale_x_log10() made the distribution look a lot better. Also, limiting the data to to be in between 0g/dm^3 and 10g/dm^3, and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

Alcohol

Alcohol column has pretty much a normal distribution and no outliers to be worried about.

Fixed Acidity(with outliers)

Fixed Acidity(without outliers)

Limiting the data to to be in between 4g/dm^3 and 13g/dm^3, and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

Volatile Acidity

Volatile acidity column has pretty much a normal distribution and no outliers to be worried about.

Citric Acid

Alcohol column has a distribution which is slightly skewed to the right but no outliers to be worried about.

Chlorides(with outliers)

Chlorides(without outliers)

Adding a scale_x_log10() made the distribution look a lot better. Also limiting the data to to be in between 0.05g/dm^3 and 0.2g/dm^3, and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

Density

Density column has pretty much a normal distribution and no outliers to be worried about.

pH

pH column has pretty much a normal distribution and no outliers to be worried about.

Sulphates(with outliers)

Sulphates(without outliers)

Limiting the data to be in between 0.25g/dm^3 and 1.25g/dm^3, and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

Free Sulphur Dioxide(with outliers)

Free Sulphur Dioxide(without outliers)

Limiting the data to less than 50mg/dm^3 and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

Total Sulfur Dioxide(with outliers)

Total Sulfur Dioxide(without outliers)

Limiting the data to less than 150mg/dm^3 and adding the right breaks makes this data look more readable and also gets rid of the outliers when compared to the original data.

What do we do next?

The quality column looks like the one which could be measured against a lot of other properties of wine. It looks like 5,6 and 7 are the most popular values for the quality rating of the wine.

Study of quality against other properties of red wine(without handling outliers)

Quality with Alcohol

There seems to be some sort of a linear relationship between alcohol and quality.

Quality with residual sugar

The first plot is the original data and the second one is after removing outiers. Even though the data is more spread out, there isn’t any obvious trend visible above.

Quality with Total Sulfur Dioxide

The first plot is the original data and the second one is after removing outiers. There seems to be some sort of a linear relationship between total sulfur dioxide and quality.

Quality with free sulfur dioxide

The first plot is the original data and the second one is after removing outiers. There is no obvious trend visible above between free sulfur dioxide and quality.

Quality with citric acid

There seems to be some sort of a linear relationship between alcohol and quality but it is not very obvious.

Quality with fixed acidity

The first plot is the original data and the second one is after removing outiers. There is no obvious trend visible above between fixed acidity and quality.

Quality with volatile acidity

There seems to be some sort of a linear relationship between volatile acidity and quality.

Quality with pH

There is no obvious trend visible above between pH and quality.

Quality with chlorides

The first plot is the original data and the second one is after removing outiers. There is no obvious trend visible above between chlorides and quality.

Quality with sulphates

The first plot is the original data and the second one is after removing outiers. There seems to be some sort of a linear relationship between volatile acidity and quality.

Upon removing outliers, the data is more spread out and clear but not all columns show trends with the red wine quality. The data trends too are better seen upon removing outliers. The quality of red wine against all the other characteristics shows that some columns seem to have a linear relationship with quality. They are alcohol, sulphates, volatile acidity and citric acid. The relationship is not very clear and obvious though. The data needs to be looked at from some other angle to be sure.

How could I transform the data so that it is easier to read?

I think, if I group the values of each of these characteristics and plot them against quality, I might be able to see a trend with increasing or decreasing values of each characteristic with wine quality. I need to perform functions like group by to do this.

Installing the required packages to perform functions like groupby, filter etc.

## 
## The downloaded binary packages are in
##  /var/folders/m6/_4rqdwhx39gf980w6rpkcx980000gn/T//Rtmp7RRBMl/downloaded_packages

I have my dplyr package installed and I will first start with grouping quality values, find a mean or median value of alcohol for each quality group and plotting them against each other. If this looks like it could give me helful information, I will do the same for the other characteristics.

Grouping alcohol values by quality

This gives me a dataset 6 rows with the mean and median alcohol value for each quality rating and the number of records in each group(n). It does show some trend as per the plot, it looks like higher quality ratings are associated with higher alcohol values. But I think I should try the other way round to see a more detailed picture. I will group quality values by alcohol values later.

Now, since I see a trend which is quite easy to understand how alcohol affects the wine quality, I will try the same for the rest of the properties.

Grouping all other characteristics by quality to see a direct impact

I initially thought that I would calculate mean and median and then choose one of them to plot against quality but I think that I should just use median. Mean can be kind of misleading because of the outliers.

I just created one big table containing all the median values for each of the chemical characterictic for each value of wine quality. I will then use the single table to pull out one chemical characterictic at a time and plot them against the quality values.

I will use a scatterplot for each and smoothen them out.

Median of fixed acidity with quality

There is a positive linear relationship fixed acidity has with wine quality. The quality of wine increases with increase in median values of fixed acidity. Although looks like it stabilizes after a point.

Median of volatile acidity with quality

There is a strong negative linear relationship volatile acidity has with wine quality. The quality of wine decreases with increase in median values of volatile acidity.

Median of citric acid with quality

There is some sort of a positive linear relationship citric acid has with wine quality. The quality of wine increases gradually with increase in median values of citric acid.

Median of residual sugar with quality

The data points are all scattered showing that there is no evident relationship between residual sugar and wine quality.

Median of chlorides with quality

There is a gradual decrease in quality with increase in chloride amount. The relationship is not very strong though as the smooth line goes up and down.

Median of free sulfur dioxide with quality

The data points are all scattered showing that there is no evident relationship between free sulfur dioxide and wine quality.

Median of total sulfur dioxide with quality

The data points are all scattered showing that there is no evident relationship between total sulfur dioxide and wine quality.

Median of density with quality

The data points are all scattered showing that there is no evident relationship between density and wine quality.

Median of pH with quality

There is a gradual decrease in quality with increase in pH. Its not a steep decline so we cannot really say that there is a negative linear relationship. But surely there is a little bit of a trend.

Median of sulphates with quality

There is a gradual increase in quality of wine with increase in sulphates. There is a positive linear relationship.

Some characteristics have a linear relationship whereas some plots are hard to read or deduce any kind of relationship.

Citric Acid and Sulphates have a positive linear relationship, the quality seems to gradually increase with increase in each of these properties. The relationship between Fixed Acidity and Quality too has somewhat of a positive linear kind of relationship.

Volatile Acidity, Chlorides and pH have a negative linear relationship. Density too showa little bit of negative linear relationship but it is not very strong.

Grouping alcohol values to see the relation between their mean and quality of wine

This is noisier data but it gives more data points than that of the plots created before this. The above plot does show that with increase in alcohol content, the quality seems to be gradually increasing. We do see a fall to the extreme right of the graph, this maybe the case with extremely high amount of alcohol(>14%), people do not really enjoy drinking this wine. If we had some more data in this region, we can tell better.

I would like to do the same with the other characteristics to see if something intersting comes up.

Grouping residual.sugar values to see the relation between their mean and quality of wine

Residual sugar has outliers, so I will filter that data out before transforming it.

As seen with residual.sugar in the set of graphs earlier, there is no trend as such seen with quality. Majority of the values seem to be seen within the 5 and 6 quality range.

Grouping fixed.acidity values to see the relation between their mean and quality of wine

Fixed acidity has outliers, so I will filter that data out before transforming it.

With fixed.acidity, it looks like there is a small dip in quality but then after the fixed acidity crosses 7.5, the quality seems to be gradually increasing. So we can say that there is a relationship but it is not very strong.

Grouping volatile.acidity values to see the relation between their mean and quality of wine

Volatile acidity clearly has a negative linear relationship with the quality of wine. The gradient shows that people are not a big fan of wine which have greater contents of volatile acidity.

Grouping total.sulfur.dioxide values and free.sulfur.dioxide values to see the relation between their mean and quality of wine

Both total and free sulfur dioxide have outliers, so I will filter that data out before transforming it.

Total sulfur dioxide too has a negative linear relationship with the quality of wine. There are a lot of data points where the total sulfur dioxide is between 100 and 150 and the mean quality is down at 5 and on the other side there are lot of data points where the mean quality is above 5 where the total sulfur dioxide is less than 100.

With free sulfur dioxide, there is a negative linear relationship but it is not very strong. There is a very gradual dip in quality with increase in free sulfur dioxide.

Grouping citric.acid values to see the relation between their mean and quality of wine

With citric acid, there is a positive linear relationship but it is not very strong. There is a very gradual increase in quality with increase in citric acid. The data does look more scattered in positive and negative directions as the citric acid increases but we see more data points on the positive side.

Grouping chlorides values to see the relation between their mean and quality of wine

With chlorides, there is a negative linear relationship but it is not very strong. There is a very gradual dip in quality with increase in chloride content. Like citric acid the data does look more scattered in positive and negative directions as the citric acid increases but we see more data points on the positive side. So I feel like it kind of stabilises beyond a point. The quality between 5 and 6 becomes more consistent.

Grouping sulphates values to see the relation between their mean and quality of wine

With sulphates, there is a positive linear relationship with quality. There is a constant increase in quality till the sulphate value increases upto 0.9 and then the data scatters. We have almost equal number of data points in both directions and cannot say much about the quality increase/decrease once the sulphate value goes beyond 1.

Grouping pH values to see the relation between their mean and quality of wine

As seen with pH in the set of graphs earlier, there is no trend as such seen with quality. Majority of the values seem to be seen within the 5 and 6 quality range.

Grouping density values to see the relation between their mean and quality of wine

As seen with density in the set of graphs earlier, there is a slight decrease in quality with increase in density of wine. It was not very strongly visible in the earlier graph but the multiple dat points in this graph show a gradual decrease in quality with increase in density. Even though a large chunk of values are between 5 and 6 range, there are data points with lower density values which have high quality. That does not exist in case of higher density values.

Correlation coefficients between quality and other characteristics of the wine

## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  rwq$quality and rwq$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Not surprisingly, alcohol seems to have the highest correlation coefficient with quality.

Since alcohol has a moderate linear relationship with the quality of the wine(highest among other characteristics), I think I should study the relation between alcohol and the other chemical properties of wine to see if there are any interesting trends.

First I wish to look at how alcohol values are distributed.

I will then rule out pH(-0.0577), free.sulfur.dioxide(-0.0506), residual.sugar(0.0137) for my study of relationship with alcohol, because they have really low correlation coefficients with quality and now not important to me. I will compare the rest of them with alcohol to see if there are any relations I can find here.

Study of relationship between alcohol content in wine and other chemical characteristics

Create a big table grouping by alcohol and medians of all other columns

Median fixed acidity with alcohol

The relationship between fixed acidity and alcohol is not linear. There seems to be a gradual increase in fixed acidity with decrease in alcohol but it is not very obvious. Its not right to conclude that they have a linear relationship.

Median volatile acidity with alcohol

The data points are extremely scattered to derive any relationship between volatile acidity and alcohol.

Median citric acid with alcohol

There seems to be a very gradual increase in citric acid with increase in alcohol even though there is an initial dip in citric acid. Overall, the relationship is not very clear.

Median chlorides with alcohol

This is interesting. Majority of the data points show a dip in alcohol amount with increase in chloride amount. So this is a negative linear relationship I see here.

Median fixed acidity with alcohol

The data points are extremely scattered to derive any relationship between total sulfur dioxide and alcohol.

This is interesting too. Majority of the data points show a dip in alcohol amount with increase in density of wine. So this is a negative linear relationship I see here.

There is a gradual increase in sulphate amount with increase in alcohol but it is not very strong. So it is hard to say that there is a relationship here.

I did the above study because I thought that if some characteristic has a strong relation with alcohol, it is indirectly related to the quality of the wine too.

Chlorides and density are the only 2 characteristics which seem strongly related to alcohol content of wine. They both show a negative linear relationship. Chlorides is something we did discover earlier having a relationship with quality. Now we have density too which we can say that, as the density of the wine decreases, the quality of the wine increases.

Plotting other characteristics versus alcohol in a different way

The above plots do not give me any helpful information. The data is very noisy and scattered and is hard to deduce anything here.

Plotting multiple(more than 2) variables

Now I wish to see how the data looks when I plot 3 varaibles together. To begin with, I would like to choose quality and alcohol to be present all the time and I will change the third variable. I will also try to incorporate the removal of the required outliers.

As seen in my earlier findings, residual sugar does not show any sort of trend with wine quality or its alcohol content. And so happened this time as well. The data points are scattered throughout the plot giving me nothing to conclude.

This is an easy plot to read, majority of the data points say that, with increase in percentage of alcohol and increase in the fixed acidity of the wine, the quality seems to be getting better.

This is an easy plot to read too, most of the data points show the turquiose, blue and pink spots(better wine quality) are concentrated on the left and upper side of the plot saying that lesser volatile acidity and higher alcohol percentage favours the quality of wine. As the volatile acidity increases the quality of the wine gradually decreases and is so evident from this graph. It is also associated with lower percentages of alcohol.

Here, the data points show the turquiose, blue and pink spots(better wine quality) are concentrated on the right and upper side of the plot saying that higher citric acid and alcohol percentage give better wine quality. The left lower side of the graph has majority of the data points showing lower wine quality and they are associated with lesser alcohol percentage and citric acid content.

Here, the data points show the turquiose, blue and pink spots(better wine quality) are concentrated on the right and upper side of the plot saying that hisher sulphate and alcohol percentage give better wine quality. The left lower side of the graph has majority of the data points showing lower wine quality and they are associated with lesser alcohol percentage and sulphate content.

The data here is slightly noisy. There is a very light trend and we can say that this relationship is not very strong. Lower chloride content and higher alcohol percentage is showing better quality is quite a few cases but not most and so is the opposite.

As seen in my earlier findings, free sulfur dioxide does not show any sort of trend with wine quality or its alcohol content. And so happened this time as well. The data points are scattered throughout the plot giving me nothing to conclude.

As seen in my earlier findings, total sulfur dioxide does not show any sort of trend with wine quality or its alcohol content. And so happened this time as well. The data points are scattered throughout the plot giving me nothing to conclude.

This data looks most interesting. pH data did not stand out so far with alochol or wine quality. But when all 3 were put together, there seems to be a trend. If a line were to be drawn from the left bottom of the plot to the top right of the plot, the data points will look quite divided in these 2 sections. The upper part contains data points with higher wine quality and the lower part contains data points with lower wine quality.

Final Plots and Summary

1.Grouping alcohol values by quality

The characteristic of wine that impacts its quality the most is the percentage of alcohol. There is a constant increase in quality with increase in its percentage of alcohol. They have a strong positive relationship.

2.Study of relationship between alcohol content in wine and other chemical characteristics

As shown above, alcohol has the strongest relationship with wine quality. Along with alcohol, I also derived that other characteristics like fixed.acidity, volatile.acidity, citric.acid and sulphates had a linear relationship too with wine quality but not as strong as the percentage of alcohol. The other characteristics did not showing anything interesting, so I decided to plot alcohol data with the other characteristics, and see if there could be an indirect connection.

Chlorides seems to be strongly related to alcohol content of wine. They have a negative linear relationship. Chlorides was not showing an obvious trend with wine quality, but since it has this negative linear relationship with percentage of alcohol, I can conclude that it would have a negative linear relationship with wine quality too(as percentage of alcohol as a poistive linear relationship with wine quality).

3.Plotting multiple(more than 2) variables

I plotted more than 2 variables this time. But I kept percentage of alcohol and wine quality constant. I studied all the other characteristics with these 2 variables to see what came up that was different from the findings I made earlier.

The most interesting finding was that of the relationship with pH. pH so far, did not come close to showing any trend with wine quality or percentage of alcohol. But when I plotted it with both of them in picture, it showed some relationship. In the above graph, majority of the points above the line belong to a higher quality category and majority of the points below the line belong to a lower quality category. It looks like lesser values of ph and higher values of alcohol percentage show better wine quality and higher values of ph with lower values of alcohol percentage show lesser values of wine quality.

REFLECTION

I have studied red wine data to some extent where I felt that I understood red wine a lot better now although there is scope for a lot more exploration here.

My primary focus was on the quality of red wine and how the other properties of wine played a role in determining its quality. Quality did feel like the most important column in this data.

Alcohol content is the column which showed a strong relationship to this data. However there were some other columns too which shared a relationship but not as strong. Some examples were columns like acidity, citric acid and sulphates.

If I had to further explore this data, I would like to analyse each of these chemical characteristics with one another.

It was an interesting struggle for me in terms of thinking how to study this data beyond just the quality column. I think that adding data like types of red wine, countries of origin, types of grapes used etc, could make the study more interesting. I also did a lot of trials in choosing what columns to put on what axes and how to make the data look sensible. Overall, I felt like, the more time and thought I give, the more ideas I can come with to look at this data from different angles and come to different conclusions.